Resampling the Peak

نویسندگان

  • Alexia Zoumpoulaki
  • Abdulmajeed Alsufyani
  • Howard Bowman
چکیده

Resampling techniques are used widely within the ERP community to assess statistical significance and especially in the deception detection literature. Here we argue that because of statistical bias, bootstrap should not be used in combination with methods like peak – to –peak. Instead permutation tests provide a more appropriate alternative. Keywords: bootstrap, permutation, significance testing, ERP, deception detection RESAMPLING THE PEAK, SOME DO’S AND DON’TS 4 Resampling the Peak, some Do’s and Don‘ts As researchers, we are typically interested to demonstrate a difference between experimental conditions. This is usually done by rejecting the null hypothesis, which asserts that any difference in the sample datasets is the result of hypothesis-irrelevant (background) variation in the data. The process involves stating the experimental hypothesis, identifying the alternative (null) hypothesis, choosing and computing an appropriate statistic, determining the frequency distribution of the statistic under the 
 null hypothesis and finally making a decision based on this distribution (Good, 2005) e.g. establishing a difference between two condition means using a t-test. But t-tests assume that the sampling distribution is normal/Gaussian. As this is not always the case, resampling techniques that avoid assumptions about the underlying distribution are often employed. Two of the most popular resampling methods are bootstrapping and permutation tests (Manly, 2006). In the case of comparing two observed samples of size m and n, bootstrapping involves randomly resampling m data points with replacement from the first observed sample and n from the second, and calculating the statistic for the new samples. Repeating this procedure many times (>1000) allows one to approximate the statistic’s distribution. Then, inferences can be made from this distribution based on its shape, center and spread. Bootstrap distributions are mainly used to calculate confidence intervals for a statistic (Hesterberg, Moore, Monaghan, Clipson, Epstein, 2005). Although some also have used them to reject a null hypothesis at an α level by showing the interval with probability 1-α does not contain the hypothesized null value of the statistic. Permutation tests allow one to generate an approximate null hypothesis distribution of the statistic. This involves randomly exchanging labels between the two data sets of observed values (one for each condition) and calculating the statistic for each new sample. RESAMPLING THE PEAK, SOME DO’S AND DON’TS 5 Again, repeating the process many times allows one to approximate the distribution of interest. The null hypothesis can be rejected at an α level if the true observed value is not contained within the interval with probability 1-α (Blair & Karniski, 1993). Importantly, both methods are used in the ERP literature to reject null hypotheses. For example Blair and Karniski (1993), advocate permutation tests, while in the ERP lie detection literature, bootstrap tests are routinely used (Rosenfield, Miller, Rao, Soskins, 2001) (although (Bowman et al., 2013; Bowman et al., 2014) are exceptions). This letter explores the consequences of this choice. In particular, although both resampling techniques are based on random theory there is a clear difference between them. The bootstrap distribution is an approximation of the statistic distribution, while the permutation distribution is an approximation of the null hypothesis distribution (Hesterberg et al., 2005). This focus on the difference of the generated distributions would largely be of esoteric interest if the derived p-values were effectively the same. However, this is not the case for certain measures, and in particular, for the maximum, which is our point of focus. The problem results from the fact that bootstrap underestimates the maximum. One can find mathematical proofs of why bootstrap fails to approximate the maximum as well as other statistics that are on the boundary of the parameter space (Bickel & Freedman, 1989; Andrews, 2000; Abrevaya & Huang, 2005; Lehmann, Romano, 2006), but a simple example could help demonstrate this. Suppose we took a sample of size 10 from a Gaussian distribution. The sample set (rounded to the 3rd decimal) is: A = {0.234, 3.488, -0.267, -0.244, -2.177, -0.405, -1.208, 0.229, -0.738, 0.288}, with a sample max of 3.488. If we were to create bootstrap samples from this observed sample there is a probability of 0.35 of not selecting the maximum value, P(not selecting max) = (1 − 1 n ) n = (1 − 1 10 ) = 0.35. As the sample has a large variance (from the max) the rest of the values are RESAMPLING THE PEAK, SOME DO’S AND DON’TS 6 far from the max and the bias (distance of the mean of the bootstrap distribution from the original statistic) is going to be large. More specifically, the mean of the bootstrap distribution is 2.585 and the bias is -0.903 (Bias = 2.585-3.488). It is clear that in this case, the bootstrap distribution underestimates the maximum value of the sample and that this underestimation is large with respect to the dispersion of the sampling distribution. Someone might argue that bootstrapping of very small-observed samples (such as in this example) is not advisable. In order to counter this concern we generated 1000 observed samples of size 1000 from a Gaussian distribution and for each one we generated the bootstrap distribution of the maximum. We found that the average bias was, Bias ̅̅ ̅̅ ̅̅ = −0.13, the average variance from the max: Varmax ̅̅ ̅̅ ̅̅ ̅̅ ̅ = 11.7 and that there was a correlation between the variance from the maximum in the observed sample and the bias. Cor(var_max, bias) = -0.814 That is, the greater the variance of the original samples from the max, the more the bootstrap mean is below the sample max, and thus the bigger the bootstrap bias. Figure 1 shows the statistic distribution as calculated for 1000 generated samples from a normal distribution for mean, max, and the correlation coefficient and difference in max between two such samples. Then the mean of the bootstrap distributions 1 from each one is over imposed. Bootstrap accurately approximates the distribution of means (Bias ̅̅ ̅̅ ̅̅ = 0.001) and correlation coefficients (Bias ̅̅ ̅̅ ̅̅ = 0.0008). Importantly, although the bias for the difference of two maxima is very small (Bias ̅̅ ̅̅ ̅̅ = 0.003), the distribution of the difference between the bootstrapped maxima is narrower than the sampling distribution. This should be expected, as bootstrap underestimates larger maxima more than smaller ones, since the variance from the maximum in a sample tends 1 All bootstrap distributions discussed in this letter are generated from 1000 resampling’s RESAMPLING THE PEAK, SOME DO’S AND DON’TS 7 to increase with the maximum. Accordingly, more extreme differences of maxima (whether positive or negative) are pushed towards zero, since they respectively include one large maximum (which is underestimated a lot) and one small maximum (which is underestimated much less). Summary of the generated distributions is given in Table 1. Efron proposed bootstrap in 1979 (Efron, 1979) and since then its weaknesses and strengths are widely known. But some of these problems have not been recognized in the ERP setting. For example in 1989, Wasserman and Bockenholt described how bootstraping could be used to make inferences about guilt from data collected in an ERP deception detection experiment (Farwell & Dochin, 1986). They used the difference of the correlation coefficient between Guilty knowledge and Task ERPs and Guilty knowledge and Irrelevant ERPs. They proposed statistical inference based upon whether the null hypothesis (difference of 0) was included in the 95% confidence interval of the bootstraped difference of correlation coefficients (Wasserman & Bockenhold, 1989). This method gained popularity and was used in the deception detection paradigm to compare the P300 component between conditions (Farwell & Donchin, 1991). But many considered that the correlation coefficient was not the most appropriate measure, since the P300 for the guilty knowledge may not resemble very closely that for the experimental task (Allen & Iacono, 1997). Instead, the difference of P300 amplitude between conditions was proposed (Rosenfeld et al., 2001). Amplitude was measured either with the peakto-peak (p2p) or with the peak-to-baseline (p2b) method (Meixner & Rosenfeld, 2010; Hu & Rosenfeld, 2012). But as both of these measurements are in the boundary of the parameter space (p2p = difference between maximum and minimum and p2b = maximum from a baseline), bootstrap is, in fact, inappropriate. Since as just documented, bootstrap’s underestimation increases with the extremity of the max/min, the distribution of the differences will be closer to RESAMPLING THE PEAK, SOME DO’S AND DON’TS 8 zero than in the null distribution, resulting in a loss of statistical power. To illustrate this, we generated two EEG datasets, each consisting of noise in the spectrum of human EEG 2 and then extracted a p-value by bootstrapping the p2p measure (Meixner & Rosenfeld, 2010). We repeated this process 10000 times, with the results in Figure 2. The distribution of p-values is not uniform, as it should be, since arbitrary noise data sets are equally likely to fall in any percentile of the null hypothesis distribution. Instead there is a bias away from extreme p-values and towards intermediate ones. At the same time, the p-values obtained from permutation tests on the same null samples are uniform. Additionally we performed the same analysis on white noise data, with the same results. In summary, bootstrap tests with p2p or p2b, as commonly used in the deception detection literature, are biased. Most significantly, the method will tend to push small p-values, which might otherwise be significant, up towards 0.5. This will induce an unnecessary loss of statistical power, suggesting that existing studies may have underestimated the effectiveness of their deception detection methods. Although permutation tests might have their limitations, permuting p2p or p2b measurements suffers no such bias and should be the inferential method of choice in this context. 2 The script used to generate noise can be found at http://www.cs.bris.ac.uk/~rafal/phasereset/. RESAMPLING THE PEAK, SOME DO’S AND DON’TS 9

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Conndence Regions for Spectral Peak Frequencies

A procedure is proposed to obtain conndence regions for spectral peak frequencies. The method is based on resampling the periodogram from the estimated spectrum in order to reestimate the spectrum and its peak frequency. We investigate the dependence of the results from the applied spectral estimator in three simulation studies and apply the method to tremor data.

متن کامل

Confidence Regions for Spectral Peak Frequencies

A procedure is proposed to obtain confidence regions for spectral peak frequencies. The method is based on resampling the penodogram from the estimated spectrum in order to reestimate the spectrum and its peak frequency. We investigate the dependence of the results from the applied spectral estimator in three simulation studies and apply the method to tremor data.

متن کامل

Resampling the peak, some dos and don'ts.

Resampling techniques are used widely within the ERP community to assess statistical significance and especially in the deception detection literature. Here, we argue that because of statistical bias, bootstrap should not be used in combination with methods like peak-to-peak. Instead, permutation tests provide a more appropriate alternative.

متن کامل

روش‌های بازنمونه‌گیری بوت استرپ و جک نایف در تحلیل بقای بیماران مبتلا به تالاسمی ماژور

Background and Objectives: A small sample size can influence the results of statistical analysis. A reduction in the sample size may happen due to different reasons, such as loss of information, i.e. existing missing value in some variables. This study aimed to apply bootstrap and jackknife resampling methods in survival analysis of thalassemia major patients. Methods: In this historical coh...

متن کامل

The Use of Monte-Carlo Simulations in Seismic Hazard Analysis in Tehran and Surrounding Areas

Probabilistic seismic hazard analysis is a technique for estimating the annual rate of exceedance of a specified ground motion at a site due to the known and suspected earthquake sources. A Monte-Carlo approach is utilized to estimate the seismic hazard at a site. This method uses numerous resampling of an earthquake catalog to construct synthetic catalogs to evaluate the ground motion hazard a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015